Introduction

Theoretical Foundation on Spectral Analysis of Protein Length Distributions

The Spectral Analysis of Distributions (SAD) methodology developed by Kolker et al. (2002) addresses a fundamental limitation of classical Fourier analysis when applied to biological sequence data. Traditional discrete Fourier transforms (DFT) require that the data interval contains an integral number of periods, which constrains period detection to discrete harmonics of the fundamental frequency.

Technical Innovation: SAD overcomes this constraint by fitting cosine waves with continuously adjustable periods to empirical distributions, enabling detection of non-integer periodicities that may reflect genuine biological constraints.

Rationale for Exploration: Protein domains may exhibit preferred sizes, potentially reflecting optimal folding units, functional constraints, or evolutionary modularity. These constraints may manifest as “periodic peaks” in length distributions at multiples of a fundamental domain size.

Simpler Explanation for Students: SAD helps us detect these hidden patterns in protein length data. Think of protein domains like building blocks - if there’s a “preferred size” block that evolution favors, we should see proteins that are 1x, 2x, 3x, or 4x this preferred size more often than random lengths.

This document implements the SAD methodology across four eukaryotic enzyme datasets, maintaining methodological consistency while handling diverse data structures.

Background - Methods for Preprocessing Non-Redundant Proteins

Non-Redundancy and Statistical Power

Theoretical Foundation: Kolker et al. OMICS 2002 study addressed potential bias from overrepresented protein families by creating a non-redundant Uniprot sequence dataset. This step wou .

Methodological Approach: The following R Scripts strive to following the paper’s SAD and Statistical Mixture Modeling protocol, we remove all proteins with identical descriptions, retaining only the longest variant. This conservative approach ensures that each unique protein function is represented only once, providing a more robust test of genuine length periodicity.

# Function to create non-redundant dataset
create_nonredundant_dataset <- function(proteins) {
  # Analyze redundancy based on protein names
  redundancy_summary <- proteins %>%
    group_by(protein_name_std) %>%
    summarise(
      count = n(),
      min_length = min(length_aa, na.rm = TRUE),
      max_length = max(length_aa, na.rm = TRUE),
      length_variants = n_distinct(length_aa),
      organism_count = n_distinct(organism_name, na.rm = TRUE),
      .groups = 'drop'
    )
  
  # Create non-redundant dataset by keeping longest variant
  nonredundant_proteins <- proteins %>%
    group_by(protein_name_std) %>%
    slice_max(length_aa, with_ties = FALSE) %>%
    ungroup()
  
  return(list(
    nonredundant = nonredundant_proteins,
    redundancy_summary = redundancy_summary
  ))
}

Core SAD Methodology: From Theory to Implementation

Mathematical Foundation: The SAD algorithm implements Kolker et al.OMICS 2002 three-step process to detect periodicities in length distributions:

Step 1: Background Removal via Moving Average

Equation 1 from Paper: For each period j, calculate the non-oscillating background:

Nonosc_j(i) = (1/j) × Σ[k=-int(j/2) to int(j/2)] Total(i+k)

Technical Purpose: This period-specific smoothing eliminates any oscillatory component at frequency j, leaving only the non-periodic background. The window size equals the test period, ensuring complete removal of that specific periodicity.

Biological Interpretation: This step removes the underlying “baseline” protein length distribution, isolating only the periodic components that might reflect evolutionary domain constraints.

Step 2: Oscillation Extraction

Equation 2 from Paper:

Osc_j(i) = Total(i) - Nonosc_j(i)

Purpose: Extract the residual oscillatory component after background removal.

Step 3: Amplitude Calculation via Cosine Transform

Equation 4 from Paper: Calculate the amplitude as a Fourier coefficient:

A_j = Σ[Osc_j(i) × cos(2πi/j)] / Σ[cos²(2πi/j)]

Mathematical Insight: This is essentially projecting the oscillatory signal onto a cosine basis function of period j. The amplitude A_j quantifies how much the data oscillates at that specific period.

Continuous Spectrum Advantage: Unlike discrete FFT, this approach tests periods at unit increments (j = 2, 3, 4, …, 160), providing a virtually continuous spectrum that can detect non-integer periodicities.

Simple Explanation: Imagine looking for a repeating beat in music. Step 1 removes the background noise, Step 2 isolates the rhythm, and Step 3 measures how strong the beat is at each possible tempo. We test many different tempos to find the strongest rhythm.

# Function to perform Spectral Analysis of Distributions (SAD)
sad_analysis_kolker <- function(data_vector, min_period = 2, max_period = 160, 
                               min_length = 50, max_length = 600) {
  # Create a frequency table of counts by length
  imin <- min_length
  imax <- max_length
  
  # Create a Total vector where Total[i] is the count of proteins with length i
  Total <- numeric(imax - imin + 1)
  names(Total) <- imin:imax
  
  for (len in data_vector) {
    if (len >= imin && len <= imax) {
      Total[as.character(len)] <- Total[as.character(len)] + 1
    }
  }
  
  # Initialize results vectors
  periods <- min_period:max_period
  amplitudes <- numeric(length(periods))
  
  # For each period to test
  for (p_idx in seq_along(periods)) {
    j <- periods[p_idx]
    
    # Define the interval excluding half-periods from both ends
    half_j <- floor(j/2)
    interval_start <- imin + half_j
    interval_end <- imax - half_j
    
    # Calculate the number of complete periods in the interval
    m <- floor((interval_end - interval_start) / j) - 1
    
    if (m < 1) {
      amplitudes[p_idx] <- 0
      next
    }
    
    # 1. Calculate non-oscillating background using weighted moving average
    Nonosc <- numeric(imax - imin + 1)
    names(Nonosc) <- imin:imax
    
    for (i in interval_start:interval_end) {
      i_str <- as.character(i)
      
      # Sum over window of size j centered at i
      window_sum <- 0
      
      for (k in -half_j:half_j) {
        idx <- as.character(i + k)
        if (idx %in% names(Total)) {
          window_sum <- window_sum + Total[idx]
        }
      }
      
      # Calculate the non-oscillating part
      Nonosc[i_str] <- window_sum / j
    }
    
    # 2. Calculate oscillating component
    Osc <- numeric(imax - imin + 1)
    names(Osc) <- imin:imax
    
    for (i in interval_start:interval_end) {
      i_str <- as.character(i)
      Osc[i_str] <- Total[i_str] - Nonosc[i_str]
    }
    
    # 3. Apply cosine Fourier transform
    valid_indices <- as.character(interval_start:interval_end)
    osc_values <- Osc[valid_indices]
    lengths <- as.numeric(valid_indices)
    cos_values <- cos(2 * pi * lengths / j)
    
    # Calculate amplitude
    numerator <- sum(osc_values * cos_values)
    denominator <- sum(cos_values^2)
    
    if (denominator > 0) {
      amplitudes[p_idx] <- numerator / denominator
    } else {
      amplitudes[p_idx] <- 0
    }
  }
  
  return(data.frame(period = periods, amplitude = amplitudes))
}

Statistical Mixture Modeling: Quantifying Biological Significance

Theoretical Framework: Beyond period detection, Kolker et al. developed a probabilistic mixture model to test the statistical significance of observed periodicities. This model combines:

Background Component: Gamma Distribution

Mathematical Formulation: Uses gamma distribution G(α,β) to model the baseline protein length distribution Biological Rationale: Gamma distributions can accommodate various skewness patterns commonly observed in protein length data, providing flexible baseline modeling.

Periodic Components: Normal Distributions

Model Structure: Peaks at multiples μ, 2μ, 3μ, 4μ with increasing variance: - Peak 1: N(μ, σ²) - Peak 2: N(2μ, 2σ²) - Peak 3: N(3μ, 3σ²)
- Peak 4: N(4μ, 4σ²)

Biological Interpretation: This models proteins composed of “i subunits”, each with mean length μ and standard deviation σ. The increasing variance reflects the compounding uncertainty with multiple domains.

Maximum Likelihood Estimation

Parameter Set: The model estimates (k+4) parameters: μ, σ, α, β, p₁, p₂, p₃, p₄ Constraint: Peak probabilities must sum to less than 1: Σpᵢ < 1

Likelihood Ratio Test

Statistical Test: Compares unconstrained model L₁ vs. background-only model L₀ Test Statistic: λ = -2ln(L₀/L₁) ~ χ²(k+2) under null hypothesis Null Hypothesis: Protein lengths follow gamma distribution without periodic peaks

Simple Explanation: We build two models - one that allows for preferred protein sizes (with peaks) and one that doesn’t (background only). If the “peaks” model is significantly better, we have evidence for preferred domain sizes. The likelihood ratio test tells us how confident we can be in this conclusion.

# Normalized gamma PDF for discrete distributions
gamma_pdf_normalized <- function(x, alpha, beta, imin, imax) {
  if (alpha <= 0 || beta <= 0) return(rep(1e-10, length(x)))
  
  x_values <- imin:imax
  raw_pdf <- dgamma(x_values, shape = alpha + 1, scale = beta)
  normalized_pdf <- raw_pdf / sum(raw_pdf)
  result <- normalized_pdf[x - imin + 1]
  return(result)
}

# Normalized normal PDF for discrete distributions
normal_pdf_normalized <- function(x, mu, sigma, imin, imax) {
  if (sigma <= 0) return(rep(1e-10, length(x)))
  
  x_values <- imin:imax
  raw_pdf <- dnorm(x_values, mean = mu, sd = sigma)
  normalized_pdf <- raw_pdf / sum(raw_pdf)
  result <- normalized_pdf[x - imin + 1]
  return(result)
}

# Mixture model fitting function with robust error handling
fit_mixture_model_kolker <- function(length_counts, period_hint, k, imin, imax) {
  lengths <- as.numeric(names(length_counts))
  counts <- as.numeric(length_counts)
  
  if (length(lengths) == 0 || length(counts) == 0) {
    warning("Empty length counts data provided")
    return(list(
      params = rep(NA, 4 + k), mu = NA, sigma = NA, alpha = NA, beta = NA,
      p_values = rep(NA, k), mu_background = NA, sigma_background = NA,
      mu_pure_background = NA, sigma_pure_background = NA,
      convergence = 1, log_likelihood = NA, background_log_likelihood = NA,
      lambda = NA, p_value = NA
    ))
  }
  
  # Initial parameter estimation
  initial_mu <- ifelse(is.null(period_hint) || is.na(period_hint) || period_hint <= 0, 100, period_hint)
  initial_sigma <- max(1, initial_mu / 10)
  
  mean_val <- sum(lengths * counts) / sum(counts)
  var_val <- sum(counts * (lengths - mean_val)^2) / sum(counts)
  
  initial_beta <- max(1, var_val / mean_val)
  if (is.na(initial_beta) || !is.finite(initial_beta)) initial_beta <- 200
  
  initial_alpha <- max(0.1, (mean_val / initial_beta) - 1)
  if (is.na(initial_alpha) || !is.finite(initial_alpha)) initial_alpha <- 1
  
  p_init <- rep(0.05, k)
  initial_params <- c(initial_mu, initial_sigma, initial_alpha, initial_beta, p_init)
  
  # Negative log-likelihood function
  mixture_nll <- function(params, lengths, counts, k, imin, imax) {
    mu <- params[1]
    sigma <- params[2]
    alpha <- params[3]
    beta <- params[4]
    p_values <- params[5:(4+k)]
    
    if (mu <= 0 || mu > 160 || sigma <= 0 || sigma > 100 || 
        alpha < 0 || alpha > 10 || beta <= 0 || beta > 1000 || 
        any(p_values < 0) || any(p_values > 1) || sum(p_values) >= 1) {
      return(1e10)
    }
    
    tryCatch({
      g_pdf <- gamma_pdf_normalized(lengths, alpha, beta, imin, imax)
      mixture_pdf <- (1 - sum(p_values)) * g_pdf
      
      for (i in 1:k) {
        peak_pdf <- normal_pdf_normalized(lengths, i*mu, sqrt(i)*sigma, imin, imax)
        mixture_pdf <- mixture_pdf + p_values[i] * peak_pdf
      }
      
      mixture_pdf <- pmax(mixture_pdf, 1e-10)
      ll <- sum(counts * log(mixture_pdf))
      return(-ll)
    }, error = function(e) {
      return(1e10)
    })
  }
  
  # Fit full model
  fit <- tryCatch({
    optim(
      par = initial_params,
      fn = mixture_nll,
      lengths = lengths, counts = counts, k = k, imin = imin, imax = imax,
      method = "L-BFGS-B",
      lower = c(20, 1, 0.01, 1, rep(0.001, k)),
      upper = c(160, 50, 10, 1000, rep(0.2, k)),
      control = list(maxit = 1000)
    )
  }, error = function(e) {
    list(par = initial_params, value = 1e10, convergence = 1)
  })
  
  # Fit background-only model
  bg_nll <- function(params, lengths, counts, imin, imax) {
    alpha <- params[1]
    beta <- params[2]
    
    if (alpha < 0 || beta <= 0) return(1e10)
    
    tryCatch({
      g_pdf <- gamma_pdf_normalized(lengths, alpha, beta, imin, imax)
      ll <- sum(counts * log(pmax(g_pdf, 1e-10)))
      return(-ll)
    }, error = function(e) {
      return(1e10)
    })
  }
  
  bg_fit <- tryCatch({
    optim(
      par = c(initial_alpha, initial_beta),
      fn = bg_nll,
      lengths = lengths, counts = counts, imin = imin, imax = imax,
      method = "L-BFGS-B",
      lower = c(0.01, 1),
      upper = c(10, 1000)
    )
  }, error = function(e) {
    list(par = c(initial_alpha, initial_beta), value = 1e10, convergence = 1)
  })
  
  # Calculate likelihood ratio test
  lambda <- if(fit$value < 1e10 && bg_fit$value < 1e10) {
    2 * (bg_fit$value - fit$value)
  } else { NA }
  
  df <- k + 2
  p_value <- if(!is.na(lambda)) { pchisq(lambda, df = df, lower.tail = FALSE) } else { NA }
  
  # Extract parameters
  params <- fit$par
  mu <- params[1]; sigma <- params[2]; alpha <- params[3]; beta <- params[4]
  p_values <- params[5:(4+k)]
  
  # Calculate derived parameters
  mu_background <- if(!is.na(alpha) && !is.na(beta)) beta * (alpha + 1) else NA
  sigma_background <- if(!is.na(alpha) && !is.na(beta)) beta * sqrt(alpha + 1) else NA
  mu_pure_background <- if(bg_fit$convergence == 0) bg_fit$par[2] * (bg_fit$par[1] + 1) else NA
  sigma_pure_background <- if(bg_fit$convergence == 0) bg_fit$par[2] * sqrt(bg_fit$par[1] + 1) else NA
  
  return(list(
    params = params, mu = mu, sigma = sigma, alpha = alpha, beta = beta,
    p_values = p_values, mu_background = mu_background, sigma_background = sigma_background,
    mu_pure_background = mu_pure_background, sigma_pure_background = sigma_pure_background,
    convergence = fit$convergence, log_likelihood = -fit$value,
    background_log_likelihood = -bg_fit$value, lambda = lambda, p_value = p_value
  ))
}

Dataset Configuration and Analysis Parameters

## Analysis Configuration:
## ======================
## Datasets to analyze: 4
## Length range: 50 - 600 amino acids
## Period range: 2 - 160 amino acids
## Number of peaks (k): 4

Uniprot Dataset Standardization and Loading

Data Harmonization Strategy

Theoretical Context: The original Kolker et al. OMICS 2002 study used SWISS-PROT Release 39.16 (2000) with specific filtering criteria: eukaryotic sequences, removal of fragments, retention of only enzymes with EC numbers, and creation of non-redundant datasets by protein description.

Implementation Challenge: Modern UniProt datasets are more comprehensive and can allow for more sequences to be analyzed. Uniprot sequences were obtained from two different sources each were also preprocessed giving rise to four datasets. In this analysis we used a “standardization function” to ensure the correct columns were used in the analysis of all four datasets and ensures methodological consistency across datasets while preserving the essential data elements (ie. “Lengths”) required for SAD analysis.

Key Data Requirements: - Entry identifiers for protein tracking - Length values (amino acid counts) for distribution analysis
- EC numbers for enzyme classification - Organism information for taxonomic filtering - Protein Sequence/names for redundancy removal

Step 1. Analysis Pipeline Overview

Integrated Analytical Workflow

Pipeline Design: This section integrates the SAD spectral analysis with mixture model fitting to provide comprehensive period detection and statistical validation, following the complete methodology from Kolker et al. OMICS 2002.

Dual Dataset Analysis: Following the paper’s approach, we analyze both complete and non-redundant datasets to assess the robustness of any detected periodicities. Consistent results across both datasets strengthen confidence in biological significance.

Parameter Integration: The preferred period detected by SAD serves as an informed prior for the mixture model’s fundamental period μ, linking the spectral and probabilistic approaches.

# Function to perform complete analysis
analyze_protein_lengths <- function(proteins_df, dataset_name, 
                                   min_length = 50, max_length = 600,
                                   min_period = 2, max_period = 160, k_peaks = 4) {
  
  # Create non-redundant dataset
  nr_result <- create_nonredundant_dataset(proteins_df)
  nonredundant_proteins <- nr_result$nonredundant
  
  # Filter by length range
  filtered_proteins <- proteins_df %>%
    filter(length_aa >= min_length & length_aa <= max_length)
  
  filtered_nonredundant <- nonredundant_proteins %>%
    filter(length_aa >= min_length & length_aa <= max_length)
  
  cat("Analysis Summary for", dataset_name, ":\n")
  cat("Total proteins:", nrow(proteins_df), "\n")
  cat("Nonredundant proteins:", nrow(nonredundant_proteins), "\n")
  cat("Proteins", min_length, "-", max_length, "aa:", nrow(filtered_proteins), 
      "(", round(nrow(filtered_proteins)/nrow(proteins_df)*100, 1), "%)\n")
  cat("Nonredundant proteins", min_length, "-", max_length, "aa:", nrow(filtered_nonredundant), 
      "(", round(nrow(filtered_nonredundant)/nrow(nonredundant_proteins)*100, 1), "%)\n\n")
  
  # Apply SAD analysis
  cat("Running SAD analysis on entire dataset...\n")
  sad_results_all <- sad_analysis_kolker(filtered_proteins$length_aa, 
                                        min_period, max_period, min_length, max_length)
  
  cat("Running SAD analysis on nonredundant dataset...\n")
  sad_results_nonredundant <- sad_analysis_kolker(filtered_nonredundant$length_aa, 
                                                 min_period, max_period, min_length, max_length)
  
  # Find preferred periods
  preferred_period_all <- sad_results_all$period[which.max(sad_results_all$amplitude)]
  preferred_period_nonredundant <- sad_results_nonredundant$period[which.max(sad_results_nonredundant$amplitude)]
  
  cat("Preferred period (entire dataset):", preferred_period_all, "aa\n")
  cat("Preferred period (nonredundant dataset):", preferred_period_nonredundant, "aa\n\n")
  
  # Prepare length counts for mixture model
  length_counts_all <- table(filtered_proteins$length_aa)
  length_counts_nonredundant <- table(filtered_nonredundant$length_aa)
  
  # Fit mixture models
  cat("Fitting mixture model to entire dataset...\n")
  model_results_all <- fit_mixture_model_kolker(length_counts_all, preferred_period_all, 
                                               k_peaks, min_length, max_length)
  
  cat("Fitting mixture model to nonredundant dataset...\n")
  model_results_nonredundant <- fit_mixture_model_kolker(length_counts_nonredundant, 
                                                        preferred_period_nonredundant, 
                                                        k_peaks, min_length, max_length)
  
  return(list(
    filtered_proteins = filtered_proteins,
    filtered_nonredundant = filtered_nonredundant,
    sad_results_all = sad_results_all,
    sad_results_nonredundant = sad_results_nonredundant,
    model_results_all = model_results_all,
    model_results_nonredundant = model_results_nonredundant,
    length_counts_all = length_counts_all,
    length_counts_nonredundant = length_counts_nonredundant,
    preferred_period_all = preferred_period_all,
    preferred_period_nonredundant = preferred_period_nonredundant
  ))
}

Step 2. Overview Visualization Functions Based on Previously Published Studies

Data Visualization: Reproducing the Kolker et al. OMICS 2002 article analysis and Figures

Figure Replication Strategy: Our visualization functions strive to recreate the key figures from the original OMICS article (Figures 1, 4, and 3) to enable direct comparison with published results and validate our implementation.

Length Distribution Plot (Figure 1 equivalent)

Technical Details: Shows raw histogram with 41-amino acid smoothing window, matching the paper’s visualization approach. The smoothing reveals underlying trends while preserving the periodic structure.

Cosine Spectrum Plot (Figure 4 equivalent)

Interpretation Guide: Peak amplitude indicates periodicity strength. The highest peak identifies the fundamental period, with the x-axis value showing the preferred domain size in amino acids.

Probability Density Plot (Figure 3 equivalent)

Model Comparison: Overlays observed data with fitted mixture model and background-only model. Vertical lines mark the fundamental period and its multiples (1×, 2×, 3×, 4×).

Visual Validation: Good model fit appears as close agreement between the blue fitted line and black data bars, with clear peaks at the predicted periodic positions.

# Function to plot length distribution
plot_length_distribution <- function(data_vector, dataset_name, min_length = 50, max_length = 600) {
  filtered_data <- data_vector[data_vector >= min_length & data_vector <= max_length]
  hist_data <- hist(filtered_data, breaks = seq(min_length, max_length, by = 1), plot = FALSE)
  
  window_size <- 41
  half_window <- floor(window_size / 2)
  
  plot_data <- data.frame(
    length = min_length:max_length,
    count = numeric(max_length - min_length + 1)
  )
  
  for (i in 1:length(hist_data$counts)) {
    if (min_length + i - 1 <= max_length) {
      plot_data$count[i] <- hist_data$counts[i]
    }
  }
  
  smoothed_counts <- numeric(nrow(plot_data))
  for (i in 1:nrow(plot_data)) {
    start_idx <- max(1, i - half_window)
    end_idx <- min(nrow(plot_data), i + half_window)
    smoothed_counts[i] <- mean(plot_data$count[start_idx:end_idx])
  }
  
  par(mar = c(5, 5, 4, 2) + 0.1, cex.axis = 1.2, cex.lab = 1.3, cex.main = 1.4)
  
  plot(plot_data$length, plot_data$count, type = 'h', 
       main = paste("Distribution of Enzyme Lengths:", dataset_name, "\n(Non-Redundant Dataset)"),
       xlab = "Protein Length", ylab = "Number of Proteins",
       xlim = c(min_length, max_length), ylim = c(0, max(plot_data$count) * 1.1),
       col = "darkblue", lwd = 1)
  
  lines(plot_data$length, smoothed_counts, col = "red", lwd = 2)
  
  legend("topright", 
         legend = c("Raw Distribution", "Smoothed (41-aa window)"),
         col = c("darkblue", "red"), lty = c(1, 1), lwd = c(1, 2),
         bg = "white", cex = 1.2)
  
  # return(list(raw = plot_data, smoothed = smoothed_counts)) # keeps RMD output concise.
}

# Function to plot cosine spectrum
plot_cosine_spectrum <- function(sad_results, dataset_name, max_period = 160) {
  plot_data <- sad_results[sad_results$period <= max_period, ]
  max_period_val <- plot_data$period[which.max(plot_data$amplitude)]
  
  par(mar = c(5, 5, 4, 2) + 0.1, cex.axis = 1.2, cex.lab = 1.3, cex.main = 1.4)
  
  plot(plot_data$period, plot_data$amplitude, type = 'l',
       main = paste("Cosine Spectrum:", dataset_name),
       xlab = "Period (amino acids)", ylab = "Amplitude",
       xlim = c(0, max_period), col = "blue", lwd = 2)
  
  # return(max_period_val) # keeping RMD output concise
}

# Function to create probability density plot
create_density_plot <- function(model_results, length_counts, dataset_name,
                               min_length = 50, max_length = 600, k = 4) {
  
  mu <- model_results$mu; sigma <- model_results$sigma
  alpha <- model_results$alpha; beta <- model_results$beta
  p_values <- model_results$p_values; p_value <- model_results$p_value
  
  if (any(is.na(c(mu, sigma, alpha, beta))) || any(is.na(p_values))) {
    warning("Invalid model parameters for ", dataset_name)
    return(NULL)
  }
  
  lengths <- min_length:max_length
  
  tryCatch({
    g_pdf <- gamma_pdf_normalized(lengths, alpha, beta, min_length, max_length)
    background_only <- g_pdf
    full_model <- (1 - sum(p_values)) * g_pdf
    
    for (i in 1:k) {
      peak_pdf <- normal_pdf_normalized(lengths, i*mu, sqrt(i)*sigma, min_length, max_length)
      full_model <- full_model + p_values[i] * peak_pdf
    }
    
    observed <- numeric(length(lengths))
    names(observed) <- as.character(lengths)
    
    for (length in names(length_counts)) {
      if (as.numeric(length) >= min_length && as.numeric(length) <= max_length) {
        observed[length] <- length_counts[length]
      }
    }
    
    total_obs <- sum(observed)
    if (total_obs > 0) observed <- observed / total_obs
    
    par(mar = c(5, 5, 4, 2) + 0.1, cex.axis = 1.2, cex.lab = 1.3, cex.main = 1.4)
    
    plot(lengths, observed, type = 'h', 
         main = paste("Probability Density:", dataset_name, "\n(Non-Redundant Dataset)"),
         xlab = "Protein Length", ylab = "Probability Density",
         xlim = c(min_length, max_length), 
         ylim = c(0, max(c(observed, full_model, background_only), na.rm = TRUE) * 1.1),
         col = "black", lwd = 1.2)
    
    lines(lengths, full_model, col = "blue", lwd = 2.5)
    lines(lengths, background_only, col = "red", lwd = 2, lty = 2)
    
    if (!is.na(mu)) {
      abline(v = c(mu, 2*mu, 3*mu, 4*mu), col = "blue", lty = 3, lwd = 1.5)
      text(c(mu, 2*mu, 3*mu, 4*mu), rep(0, 4), c("1×", "2×", "3×", "4×"), 
           pos = 3, col = "blue", cex = 1.2)
    }
    
    legend("topright", 
           legend = c("Data", "Estimated model", "Background only"),
           col = c("black", "blue", "red"), lty = c(1, 1, 2), lwd = c(1, 2.5, 2),
           bg = "white", cex = 1.2)
    
    mtext(paste("Period =", round(mu, 2), "aa (p-value =", format(p_value, scientific = TRUE, digits = 2), ")"), 
          side = 1, line = 3, cex = 1.2)
    
    # return(list(lengths = lengths, observed = observed, full_model = full_model,   # keeping RMD output concise
    #            background_only = background_only))
  }, error = function(e) {
    warning(paste("Error creating density plot for", dataset_name, ":", e$message))
    return(NULL)
  })
}

Step 3. - Dataset 1 - Analysis Pipeline Execution

# New dataset obtained by GS 9/7/25 read_tsv("uniprotkb_taxonomy_id_2759_AND_reviewed_unique_2.tsv") GS 9/7/25

cat("Loading dataset:", datasets[1], "\n")
## Loading dataset: dataset_1.csv
# Read the dataset
proteins_raw_1 <- read_csv(datasets[1], show_col_types = FALSE)
cat("Raw dataset dimensions:", nrow(proteins_raw_1), "x", ncol(proteins_raw_1), "\n")
## Raw dataset dimensions: 14685 x 7
# Standardize the dataset
proteins_1 <- standardize_dataset(proteins_raw_1, dataset_names[1])
## Dataset: Dataset_1 
## Original columns: Entry, Reviewed, Entry Name, Protein names, Gene Names, Organism, Length 
## Rows before filtering: 14685 
## Rows after filtering: 14685 
## Has EC number column: FALSE 
## Has Sequence column: FALSE 
## Has SeqID column: FALSE
cat("Standardized dataset dimensions:", nrow(proteins_1), "x", ncol(proteins_1), "\n")
## Standardized dataset dimensions: 14685 x 19
# Run the complete analysis
results_1 <- analyze_protein_lengths(proteins_1, dataset_names[1], 
                                   analysis_params$min_length, analysis_params$max_length,
                                   analysis_params$min_period, analysis_params$max_period, 
                                   analysis_params$k_peaks)
## Analysis Summary for Dataset_1 :
## Total proteins: 14685 
## Nonredundant proteins: 14685 
## Proteins 50 - 600 aa: 14685 ( 100 %)
## Nonredundant proteins 50 - 600 aa: 14685 ( 100 %)
## 
## Running SAD analysis on entire dataset...
## Running SAD analysis on nonredundant dataset...
## Preferred period (entire dataset): 123 aa
## Preferred period (nonredundant dataset): 123 aa
## 
## Fitting mixture model to entire dataset...
## Fitting mixture model to nonredundant dataset...
# Store results for summary
all_results[[1]] <- results_1

Step 3.1 - Dataset 1 - Eukaryotic Protein Length Distribution - Figure S3a

Step 3.2 - Dataset 1 - Cosine Spectrum - Figure S4a

Step 3.3 - Dataset 1 - Probability Density

Step 3.4 - Dataset 1 - Statistical Modeling Summary - component of Table 5

Table 5 Processed Datasets: Statistical Parameters and p-values (non-redundant)

## ===== STATISTICAL PARAMETERS FOR DATASET_1 =====
##                           All Proteins         Non-redundant
## μ_pure_background        482.5149             482.5149
## σ_pure_background        213.0603             213.0603
## μ_background             450.2763             450.2763
## σ_background             193.0135             193.0135
## μ (fundamental period)   125.2119             125.2119
## σ                        15.7515              15.7515
## p1                        0.0108               0.0108              
## p2                        0.0010               0.0010              
## p3                        0.0498               0.0498              
## p4                        0.0850               0.0850
## p-value                   2.36e-51             2.36e-51

Step 4. - Dataset 2 Analysis Pipeline Execution

## Loading dataset: dataset_2.csv
## Raw dataset dimensions: 13328 x 7
## Dataset: Dataset_2 
## Original columns: Entry, Reviewed, Entry Name, Protein names, Gene Names, Organism, Length 
## Rows before filtering: 13328 
## Rows after filtering: 13328 
## Has EC number column: FALSE 
## Has Sequence column: FALSE 
## Has SeqID column: FALSE
## Standardized dataset dimensions: 13328 x 19
## Analysis Summary for Dataset_2 :
## Total proteins: 13328 
## Nonredundant proteins: 13328 
## Proteins 50 - 600 aa: 13328 ( 100 %)
## Nonredundant proteins 50 - 600 aa: 13328 ( 100 %)
## 
## Running SAD analysis on entire dataset...
## Running SAD analysis on nonredundant dataset...
## Preferred period (entire dataset): 123 aa
## Preferred period (nonredundant dataset): 123 aa
## 
## Fitting mixture model to entire dataset...
## Fitting mixture model to nonredundant dataset...

Step 4.1 - Dataset 2 - Eukaryotic Protein Length Distribution - Figure 3a

Step 4.2 - Dataset 2 - Cosine Spectrum - Figure 4a

Step 4.3 - Dataset 2 - Probability Density

Step 4.4 - Dataset 2 - Statistical Modeling Summary - component of Table 5

## ===== STATISTICAL PARAMETERS FOR DATASET_2 =====
##                           All Proteins         Non-redundant
## μ_pure_background        484.6348             484.6348
## σ_pure_background        214.6755             214.6755
## μ_background             420.2115             420.2115
## σ_background             162.8579             162.8579
## μ (fundamental period)   129.0371             129.0371
## σ                        17.7137              17.7137
## p1                        0.0216               0.0216              
## p2                        0.0010               0.0010              
## p3                        0.0278               0.0278              
## p4                        0.1072               0.1072
## p-value                   1.31e-50             1.31e-50

Step 5. - Dataset 3 Analysis Pipeline Execution

# Independent reviewer 3 dataset originally called: "uniprotkb_taxonomy_id_2759_AND_ec_AND_r_2025_06_27.csv" was renamed dataset_3

cat("Loading dataset:", datasets[3], "\n")
## Loading dataset: dataset_3.csv
# Read the dataset
proteins_raw_3 <- read_csv(datasets[3], show_col_types = FALSE)
cat("Raw dataset dimensions:", nrow(proteins_raw_3), "x", ncol(proteins_raw_3), "\n")
## Raw dataset dimensions: 27600 x 10
# Standardize the dataset
proteins_3 <- standardize_dataset(proteins_raw_3, dataset_names[3])
## Dataset: Dataset_3 
## Original columns: Entry, Reviewed, Entry Name, Protein names, Gene Names, Organism, Length, Sequence, Gene encoded by, EC number 
## Rows before filtering: 27600 
## Rows after filtering: 27600 
## Has EC number column: TRUE 
## Has Sequence column: TRUE 
## Has SeqID column: FALSE
cat("Standardized dataset dimensions:", nrow(proteins_3), "x", ncol(proteins_3), "\n")
## Standardized dataset dimensions: 27600 x 22
# Run the complete analysis
results_3 <- analyze_protein_lengths(proteins_3, dataset_names[3], 
                                   analysis_params$min_length, analysis_params$max_length,
                                   analysis_params$min_period, analysis_params$max_period, 
                                   analysis_params$k_peaks)
## Analysis Summary for Dataset_3 :
## Total proteins: 27600 
## Nonredundant proteins: 21813 
## Proteins 50 - 600 aa: 18985 ( 68.8 %)
## Nonredundant proteins 50 - 600 aa: 14881 ( 68.2 %)
## 
## Running SAD analysis on entire dataset...
## Running SAD analysis on nonredundant dataset...
## Preferred period (entire dataset): 123 aa
## Preferred period (nonredundant dataset): 123 aa
## 
## Fitting mixture model to entire dataset...
## Fitting mixture model to nonredundant dataset...
# Store results for summary
all_results[[3]] <- results_3

Step 5.1 - Dataset 3 - Eukaryotic Protein Length Distribution - Figure S3b

Step 5.2 - Dataset 3 - Cosine Spectrum - Figure S4b

Step 5.3 - Dataset 3 - Probability Density

Step 5.4 - Dataset 3 - Statistical Modeling Parameter Summary

## ===== STATISTICAL PARAMETERS FOR DATASET_3 =====
##                           All Proteins         Non-redundant
## μ_pure_background        498.7061             500.2965
## σ_pure_background        240.2406             234.2714
## μ_background             507.5532             499.7194
## σ_background             265.8698             245.8420
## μ (fundamental period)   123.0974             122.9788
## σ                        17.5746              15.7036
## p1                        0.0017               0.0036              
## p2                        0.0067               0.0058              
## p3                        0.0957               0.0771              
## p4                        0.0941               0.0821
## p-value                   1.54e-75             3.05e-55

Step 6. - Dataset 4 - Analysis Pipeline Execution

Source file: filtered_enzyme_dataset_clusterize80.csv

cat("Loading dataset:", datasets[4], "\n")
## Loading dataset: dataset_4.csv
# Read the dataset
proteins_raw_4 <- read_csv(datasets[4], show_col_types = FALSE)
cat("Raw dataset dimensions:", nrow(proteins_raw_4), "x", ncol(proteins_raw_4), "\n")
## Raw dataset dimensions: 20252 x 11
# Standardize the dataset
proteins_4 <- standardize_dataset(proteins_raw_4, dataset_names[4])
## Dataset: Dataset_4 
## Original columns: Entry, Reviewed, Entry Name, Protein names, Gene Names, Organism, Length, Sequence, Gene encoded by, EC number, SeqID 
## Rows before filtering: 20252 
## Rows after filtering: 20252 
## Has EC number column: TRUE 
## Has Sequence column: TRUE 
## Has SeqID column: TRUE
cat("Standardized dataset dimensions:", nrow(proteins_4), "x", ncol(proteins_4), "\n")
## Standardized dataset dimensions: 20252 x 23
# Run the complete analysis
results_4 <- analyze_protein_lengths(proteins_4, dataset_names[4], 
                                   analysis_params$min_length, analysis_params$max_length,
                                   analysis_params$min_period, analysis_params$max_period, 
                                   analysis_params$k_peaks)
## Analysis Summary for Dataset_4 :
## Total proteins: 20252 
## Nonredundant proteins: 18063 
## Proteins 50 - 600 aa: 13980 ( 69 %)
## Nonredundant proteins 50 - 600 aa: 12378 ( 68.5 %)
## 
## Running SAD analysis on entire dataset...
## Running SAD analysis on nonredundant dataset...
## Preferred period (entire dataset): 123 aa
## Preferred period (nonredundant dataset): 123 aa
## 
## Fitting mixture model to entire dataset...
## Fitting mixture model to nonredundant dataset...
# Store results for summary
all_results[[4]] <- results_4

Step 6.1 - Dataset 4 - Eukaryotic Protein Length Distribution - Figure 3b

Step 6.2 - Dataset 4 - Cosine Spectrum - Figure 4b

Step 6.3 - Dataset 4 - Probability Density -

Step 6.4 - Dataset 4 - Statistical Mixture Modeling Summary - Table 5 component

cat("===== STATISTICAL PARAMETERS FOR", toupper(dataset_names[4]), "=====\n")
## ===== STATISTICAL PARAMETERS FOR DATASET_4 =====
cat(sprintf("%-25s %-20s %-20s\n", "", "All Proteins", "Non-redundant"))
##                           All Proteins         Non-redundant
cat(sprintf("%-25s %-20.4f %-20.4f\n", "μ_pure_background", 
            results_4$model_results_all$mu_pure_background, 
            results_4$model_results_nonredundant$mu_pure_background))
## μ_pure_background        487.7593             489.0297
cat(sprintf("%-25s %-20.4f %-20.4f\n", "σ_pure_background", 
            results_4$model_results_all$sigma_pure_background, 
            results_4$model_results_nonredundant$sigma_pure_background))
## σ_pure_background        222.0246             218.8946
cat(sprintf("%-25s %-20.4f %-20.4f\n", "μ_background", 
            results_4$model_results_all$mu_background, 
            results_4$model_results_nonredundant$mu_background))
## μ_background             510.0662             509.4094
cat(sprintf("%-25s %-20.4f %-20.4f\n", "σ_background", 
            results_4$model_results_all$sigma_background, 
            results_4$model_results_nonredundant$sigma_background))
## σ_background             252.8820             247.7414
cat(sprintf("%-25s %-20.4f %-20.4f\n", "μ (fundamental period)", 
            results_4$model_results_all$mu, 
            results_4$model_results_nonredundant$mu))
## μ (fundamental period)   122.1311             122.4846
cat(sprintf("%-25s %-20.4f %-20.4f\n", "σ", 
            results_4$model_results_all$sigma, 
            results_4$model_results_nonredundant$sigma))
## σ                        16.7743              17.5077
for (i in 1:analysis_params$k_peaks) {
  cat(sprintf("%-25s %-20.4f %-20.4f\n", paste0("p", i), 
              results_4$model_results_all$p_values[i], 
              results_4$model_results_nonredundant$p_values[i]))
}
## p1                        0.0010               0.0010              
## p2                        0.0147               0.0173              
## p3                        0.0860               0.0874              
## p4                        0.0845               0.0894
cat(sprintf("%-25s %-20s %-20s\n", "p-value", 
            format(results_4$model_results_all$p_value, scientific = TRUE, digits = 3), 
            format(results_4$model_results_nonredundant$p_value, scientific = TRUE, digits = 3)))
## p-value                   6.18e-41             3.23e-33

Step 7. - Consolidated Statistical Parameters All 4 Datasets - Table 5 -

# Create comprehensive statistical parameters table
statistical_params_table <- data.frame(
  Parameter = c("μ_pure_background", "σ_pure_background", "μ_background", "σ_background", 
                "μ (fundamental period)", "σ", "p1", "p2", "p3", "p4", "p-value"),
  
  # Dataset_1_All = c(
  #  round(all_results[[1]]$model_results_all$mu_pure_background, 4),
  #  round(all_results[[1]]$model_results_all$sigma_pure_background, 4),
  #  round(all_results[[1]]$model_results_all$mu_background, 4),
  #  round(all_results[[1]]$model_results_all$sigma_background, 4),
  #  round(all_results[[1]]$model_results_all$mu, 4),
  #  round(all_results[[1]]$model_results_all$sigma, 4),
  #  round(all_results[[1]]$model_results_all$p_values[1], 4),
  #  round(all_results[[1]]$model_results_all$p_values[2], 4),
  #  round(all_results[[1]]$model_results_all$p_values[3], 4),
  #  round(all_results[[1]]$model_results_all$p_values[4], 4),
  #  format(all_results[[1]]$model_results_all$p_value, scientific = TRUE, digits = 3)
  # ),
  
  Dataset_1_NR = c(
    round(all_results[[1]]$model_results_nonredundant$mu_pure_background, 4),
    round(all_results[[1]]$model_results_nonredundant$sigma_pure_background, 4),
    round(all_results[[1]]$model_results_nonredundant$mu_background, 4),
    round(all_results[[1]]$model_results_nonredundant$sigma_background, 4),
    round(all_results[[1]]$model_results_nonredundant$mu, 4),
    round(all_results[[1]]$model_results_nonredundant$sigma, 4),
    round(all_results[[1]]$model_results_nonredundant$p_values[1], 4),
    round(all_results[[1]]$model_results_nonredundant$p_values[2], 4),
    round(all_results[[1]]$model_results_nonredundant$p_values[3], 4),
    round(all_results[[1]]$model_results_nonredundant$p_values[4], 4),
    format(all_results[[1]]$model_results_nonredundant$p_value, scientific = TRUE, digits = 3)
  ),
  
  # Dataset_2_All = c(
  #  round(all_results[[2]]$model_results_all$mu_pure_background, 4),
  #  round(all_results[[2]]$model_results_all$sigma_pure_background, 4),
  #  round(all_results[[2]]$model_results_all$mu_background, 4),
  #  round(all_results[[2]]$model_results_all$sigma_background, 4),
  #  round(all_results[[2]]$model_results_all$mu, 4),
  #  round(all_results[[2]]$model_results_all$sigma, 4),
  #  round(all_results[[2]]$model_results_all$p_values[1], 4),
  #  round(all_results[[2]]$model_results_all$p_values[2], 4),
  #  round(all_results[[2]]$model_results_all$p_values[3], 4),
  #  round(all_results[[2]]$model_results_all$p_values[4], 4),
  #  format(all_results[[2]]$model_results_all$p_value, scientific = TRUE, digits = 3)
  # ),
  
  Dataset_2_NR = c(
    round(all_results[[2]]$model_results_nonredundant$mu_pure_background, 4),
    round(all_results[[2]]$model_results_nonredundant$sigma_pure_background, 4),
    round(all_results[[2]]$model_results_nonredundant$mu_background, 4),
    round(all_results[[2]]$model_results_nonredundant$sigma_background, 4),
    round(all_results[[2]]$model_results_nonredundant$mu, 4),
    round(all_results[[2]]$model_results_nonredundant$sigma, 4),
    round(all_results[[2]]$model_results_nonredundant$p_values[1], 4),
    round(all_results[[2]]$model_results_nonredundant$p_values[2], 4),
    round(all_results[[2]]$model_results_nonredundant$p_values[3], 4),
    round(all_results[[2]]$model_results_nonredundant$p_values[4], 4),
    format(all_results[[2]]$model_results_nonredundant$p_value, scientific = TRUE, digits = 3)
  ),
  
  # Dataset_3_All = c(
  #  round(all_results[[3]]$model_results_all$mu_pure_background, 4),
  #  round(all_results[[3]]$model_results_all$sigma_pure_background, 4),
  #  round(all_results[[3]]$model_results_all$mu_background, 4),
  #  round(all_results[[3]]$model_results_all$sigma_background, 4),
  #  round(all_results[[3]]$model_results_all$mu, 4),
  #  round(all_results[[3]]$model_results_all$sigma, 4),
  #  round(all_results[[3]]$model_results_all$p_values[1], 4),
  #  round(all_results[[3]]$model_results_all$p_values[2], 4),
  #  round(all_results[[3]]$model_results_all$p_values[3], 4),
  #  round(all_results[[3]]$model_results_all$p_values[4], 4),
  #  format(all_results[[3]]$model_results_all$p_value, scientific = TRUE, digits = 3)
  # ),
  
  Dataset_3_NR = c(
    round(all_results[[3]]$model_results_nonredundant$mu_pure_background, 4),
    round(all_results[[3]]$model_results_nonredundant$sigma_pure_background, 4),
    round(all_results[[3]]$model_results_nonredundant$mu_background, 4),
    round(all_results[[3]]$model_results_nonredundant$sigma_background, 4),
    round(all_results[[3]]$model_results_nonredundant$mu, 4),
    round(all_results[[3]]$model_results_nonredundant$sigma, 4),
    round(all_results[[3]]$model_results_nonredundant$p_values[1], 4),
    round(all_results[[3]]$model_results_nonredundant$p_values[2], 4),
    round(all_results[[3]]$model_results_nonredundant$p_values[3], 4),
    round(all_results[[3]]$model_results_nonredundant$p_values[4], 4),
    format(all_results[[3]]$model_results_nonredundant$p_value, scientific = TRUE, digits = 3)
  ),
  
  # Dataset_4_All = c(
  #  round(all_results[[4]]$model_results_all$mu_pure_background, 4),
  #  round(all_results[[4]]$model_results_all$sigma_pure_background, 4),
  #  round(all_results[[4]]$model_results_all$mu_background, 4),
  #  round(all_results[[4]]$model_results_all$sigma_background, 4),
  #  round(all_results[[4]]$model_results_all$mu, 4),
  #  round(all_results[[4]]$model_results_all$sigma, 4),
  #  round(all_results[[4]]$model_results_all$p_values[1], 4),
  #  round(all_results[[4]]$model_results_all$p_values[2], 4),
  #  round(all_results[[4]]$model_results_all$p_values[3], 4),
  #  round(all_results[[4]]$model_results_all$p_values[4], 4),
  #  format(all_results[[4]]$model_results_all$p_value, scientific = TRUE, digits = 3)
  # ),
  
  Dataset_4_NR = c(
    round(all_results[[4]]$model_results_nonredundant$mu_pure_background, 4),
    round(all_results[[4]]$model_results_nonredundant$sigma_pure_background, 4),
    round(all_results[[4]]$model_results_nonredundant$mu_background, 4),
    round(all_results[[4]]$model_results_nonredundant$sigma_background, 4),
    round(all_results[[4]]$model_results_nonredundant$mu, 4),
    round(all_results[[4]]$model_results_nonredundant$sigma, 4),
    round(all_results[[4]]$model_results_nonredundant$p_values[1], 4),
    round(all_results[[4]]$model_results_nonredundant$p_values[2], 4),
    round(all_results[[4]]$model_results_nonredundant$p_values[3], 4),
    round(all_results[[4]]$model_results_nonredundant$p_values[4], 4),
    format(all_results[[4]]$model_results_nonredundant$p_value, scientific = TRUE, digits = 3)
  )
)

# Display the comprehensive statistical parameters table
kable(statistical_params_table, 
      col.names = c("Parameter", "NR", "NR", "NR", "NR"),
      caption = "Comprehensive Statistical Parameters: All Four Datasets") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE, font_size = 10) %>%
  add_header_above(c(" " = 1, "Dataset 1" = 1, "Dataset 2" = 1, "Dataset 3" = 1, "Dataset 4" = 1)) %>%
  column_spec(1, bold = TRUE, width = "3cm") %>%
  column_spec(2:5, width = "1.5cm")
Comprehensive Statistical Parameters: All Four Datasets
Dataset 1
Dataset 2
Dataset 3
Dataset 4
Parameter NR NR NR NR
μ_pure_background 482.5149 484.6348 500.2965 489.0297
σ_pure_background 213.0603 214.6755 234.2714 218.8946
μ_background 450.2763 420.2115 499.7194 509.4094
σ_background 193.0135 162.8579 245.842 247.7414
μ (fundamental period) 125.2119 129.0371 122.9788 122.4846
σ 15.7515 17.7137 15.7036 17.5077
p1 0.0108 0.0216 0.0036 0.001
p2 0.001 0.001 0.0058 0.0173
p3 0.0498 0.0278 0.0771 0.0874
p4 0.085 0.1072 0.0821 0.0894
p-value 2.36e-51 1.31e-50 3.05e-55 3.23e-33

Step 8. - Combined Probability Density Plots - Supplemental Figure SX -

# Create ggplot versions of density plots for patchwork
create_ggplot_density <- function(model_results, length_counts, dataset_name, 
                                 min_length = 50, max_length = 600, k = 4) {
  
  mu <- model_results$mu; sigma <- model_results$sigma
  alpha <- model_results$alpha; beta <- model_results$beta
  p_values <- model_results$p_values; p_value <- model_results$p_value
  
  if (any(is.na(c(mu, sigma, alpha, beta))) || any(is.na(p_values))) {
    return(ggplot() + ggtitle(paste("Error:", dataset_name)) + theme_void())
  }
  
  lengths <- min_length:max_length
  
  tryCatch({
    g_pdf <- gamma_pdf_normalized(lengths, alpha, beta, min_length, max_length)
    background_only <- g_pdf
    full_model <- (1 - sum(p_values)) * g_pdf
    
    for (i in 1:k) {
      peak_pdf <- normal_pdf_normalized(lengths, i*mu, sqrt(i)*sigma, min_length, max_length)
      full_model <- full_model + p_values[i] * peak_pdf
    }
    
    observed <- numeric(length(lengths))
    names(observed) <- as.character(lengths)
    
    for (length in names(length_counts)) {
      if (as.numeric(length) >= min_length && as.numeric(length) <= max_length) {
        observed[length] <- length_counts[length]
      }
    }
    
    total_obs <- sum(observed)
    if (total_obs > 0) observed <- observed / total_obs
    
    # Create plot data
    plot_data <- data.frame(
      length = lengths,
      observed = observed,
      full_model = full_model,
      background_only = background_only
    )
    
    p <- ggplot(plot_data, aes(x = length)) +
      geom_col(aes(y = observed), color = "black", fill = "lightgray", width = 0.8) +
      geom_line(aes(y = full_model), color = "blue", size = 1.2) +
      geom_line(aes(y = background_only), color = "red", size = 1, linetype = "dashed") +
      labs(title = paste("Probability Density:", dataset_name),
           subtitle = paste("μ =", round(mu, 2), "aa, p =", format(p_value, scientific = TRUE, digits = 2)),
           x = "Protein Length (aa)", 
           y = "Probability Density") +
      xlim(min_length, max_length) +
      theme_minimal() +
      theme(plot.title = element_text(size = 12, hjust = 0.5),
            plot.subtitle = element_text(size = 10, hjust = 0.5),
            axis.title = element_text(size = 10),
            axis.text = element_text(size = 9))
    
    # Add periodic markers
    if (!is.na(mu)) {
      p <- p + geom_vline(xintercept = c(mu, 2*mu, 3*mu, 4*mu), 
                         color = "blue", linetype = "dotted", alpha = 0.7)
    }
    
    return(p)
    
  }, error = function(e) {
    return(ggplot() + ggtitle(paste("Error:", dataset_name)) + theme_void())
  })
}

# Create individual density plots
p1_density <- create_ggplot_density(all_results[[1]]$model_results_nonredundant, 
                                   all_results[[1]]$length_counts_nonredundant, 
                                   dataset_names[1], analysis_params$min_length, 
                                   analysis_params$max_length, analysis_params$k_peaks)

p2_density <- create_ggplot_density(all_results[[2]]$model_results_nonredundant, 
                                   all_results[[2]]$length_counts_nonredundant, 
                                   dataset_names[2], analysis_params$min_length, 
                                   analysis_params$max_length, analysis_params$k_peaks)

p3_density <- create_ggplot_density(all_results[[3]]$model_results_nonredundant, 
                                   all_results[[3]]$length_counts_nonredundant, 
                                   dataset_names[3], analysis_params$min_length, 
                                   analysis_params$max_length, analysis_params$k_peaks)

p4_density <- create_ggplot_density(all_results[[4]]$model_results_nonredundant, 
                                   all_results[[4]]$length_counts_nonredundant, 
                                   dataset_names[4], analysis_params$min_length, 
                                   analysis_params$max_length, analysis_params$k_peaks)

# Combine with patchwork
combined_density <- (p1_density | p2_density) / (p3_density | p4_density)
combined_density + plot_annotation(
  title = "Comparative Probability Density Models: All Four Datasets (Non-redundant)",
  subtitle = "Mixture Model Fitting with Periodic Components vs Background-Only Models",
  theme = theme(plot.title = element_text(size = 16, hjust = 0.5),
                plot.subtitle = element_text(size = 12, hjust = 0.5))
)

Step 9. - Advancing Statistical Analysis by Team C - Extended from OMICS 2002 Paper -

## STATISTICAL MIXTURE MODEL: Extension of OMICS 2002 Framework
## OMICS 2002 introduced probabilistic mixture model combining:
## 1. Smooth background distribution (gamma family for flexibility with skewness)
## 2. Individual peaks at preferred lengths μ and multiples 2μ, 3μ, ..., kμ
## 3. Peak strengths p1, p2, ..., pk and standard deviations σ, √2σ, ..., √kσ

## BIOLOGICAL INTERPRETATION: "protein chain composed of i subunits, each with
## mean length μ and standard deviation σ, statistically independent"
## This reflects domain-based organization of eukaryotic proteins

## MODERN STATISTICAL EXTENSION: Maximum Likelihood with Constrained Optimization
## OMICS 2002 used "method of maximum likelihood" but modern implementation
## uses robust optimization algorithms (L-BFGS-B) with parameter constraints

# Log-likelihood for full mixture model (gamma background + 4 normal peaks)
log_likelihood_mixture <- function(params, lengths) {
  alpha <- params[1]    # gamma shape
  beta <- params[2]     # gamma scale  
  mu <- params[3]       # base period
  sigma <- params[4]    # base std dev
  p1 <- params[5]       # weight for 1x peak
  p2 <- params[6]       # weight for 2x peak
  p3 <- params[7]       # weight for 3x peak
  p4 <- params[8]       # weight for 4x peak
  
  # Constraint: weights must sum to <= 1
  if (sum(p1, p2, p3, p4) > 1) return(Inf)
  
  p0 <- 1 - (p1 + p2 + p3 + p4)  # background weight
  
  ## MIXTURE MODEL COMPONENTS: Following OMICS 2002 Statistical Framework
  ## Background: Gamma distribution for flexible background shapes
  ## Peaks: Normal distributions at multiples of base period μ
  ## Standard deviations scale as √i*σ for i-th multiple (central limit theorem)
  
  # Component PDFs
  background_pdf <- dgamma(lengths, shape = alpha, scale = beta)
  peak1_pdf <- dnorm(lengths, mean = mu, sd = sigma)
  peak2_pdf <- dnorm(lengths, mean = 2*mu, sd = sqrt(2)*sigma)
  peak3_pdf <- dnorm(lengths, mean = 3*mu, sd = sqrt(3)*sigma)
  peak4_pdf <- dnorm(lengths, mean = 4*mu, sd = sqrt(4)*sigma)
  
  # Mixture PDF
  mixture_pdf <- p0 * background_pdf + p1 * peak1_pdf + p2 * peak2_pdf + 
                 p3 * peak3_pdf + p4 * peak4_pdf
  
  # Return negative log-likelihood for minimization
  -sum(log(mixture_pdf + 1e-10))
}

# Log-likelihood for background-only model (gamma distribution)
log_likelihood_background <- function(params, lengths) {
  alpha <- params[1]
  beta <- params[2]
  
  background_pdf <- dgamma(lengths, shape = alpha, scale = beta)
  -sum(log(background_pdf + 1e-10))
}

## STATISTICAL TESTING FRAMEWORK: Extensions Beyond OMICS 2002
## Original paper used likelihood ratio test for significance testing
## Modern implementation adds:
## 1. AIC/BIC model comparison (information-theoretic criteria)
## 2. Bootstrap confidence intervals (uncertainty quantification)
## 3. Robust optimization with convergence diagnostics

Step 10. - Sequential Dataset Processing - Statistical Analysis Team C -

# Sequential Dataset Processing 
# Initialize storage for results
all_results_teamc <- list()

# Loop through each dataset
for (dataset_idx in 1:length(datasets)) {
  # Use the existing datasets and dataset_names vectors
  filename <- datasets[dataset_idx]
  variable_name <- dataset_names[dataset_idx]
  description <- paste("Dataset", dataset_idx, "-", variable_name)
  
  cat("\n## Dataset", dataset_idx, ":", variable_name, "\n\n")
  
  # Initialize results storage
  dataset_results <- list(
    variable_name = variable_name,
    filename = filename,
    description = description
  )
  
  tryCatch({
    # Load data - Use the already loaded and standardized data
    cat("Using pre-loaded and standardized data for", variable_name, "...\n")
    
    # Get the appropriate filtered dataset from your existing results
    analysis_data <- all_results[[dataset_idx]]$filtered_proteins
    
    # Store initial row count
    initial_row_count <- nrow(analysis_data)
    
    # The data is already filtered for 50-600 aa in your pipeline, so use it directly
    enzyme_lengths <- analysis_data$length_aa
    
    cat("Using", length(enzyme_lengths), "valid sequences (50-600 aa)\n")
    
    # Store basic info including initial count
    dataset_results$initial_row_count <- initial_row_count
    dataset_results$n_sequences <- length(enzyme_lengths)
    dataset_results$mean_length <- mean(enzyme_lengths)
    dataset_results$median_length <- median(enzyme_lengths)
    dataset_results$sd_length <- sd(enzyme_lengths)
    dataset_results$min_length <- min(enzyme_lengths)
    dataset_results$max_length <- max(enzyme_lengths)
    
    
    if (length(enzyme_lengths) > 0) {
      
      ## MIXTURE MODEL ANALYSIS: Statistical Framework Implementation
      ## OMICS 2002 introduced mixture model for rigorous statistical testing
      ## Modern implementation extends with robust optimization and uncertainty quantification
      
      # MIXTURE MODEL ANALYSIS
      cat("Fitting mixture model...\n")
      
      ## PARAMETER INITIALIZATION: Based on OMICS 2002 Estimates
      ## Original study reported: μ ≈ 125 aa, σ ≈ 12 aa
      ## Gamma parameters estimated from background distribution characteristics
      ## Mixing weights initialized based on observed peak prominence
      
      # Parameter setup
      initial_params <- c(1.5, 200, 120, 12, 0.05, 0.1, 0.08, 0.07)
      lower_bounds <- c(0.1, 50, 100, 5, 0, 0, 0, 0)
      upper_bounds <- c(10, 500, 150, 20, 0.3, 0.3, 0.3, 0.3)
      
      ## OPTIMIZATION STRATEGY: Modern Maximum Likelihood Implementation
      ## METHODOLOGICAL EXTENSION: L-BFGS-B algorithm with box constraints
      ## provides more robust parameter estimation than methods available in 2002
      
      # Fit full mixture model
      mixture_result <- optim(
        par = initial_params,
        fn = log_likelihood_mixture,
        lengths = enzyme_lengths,
        method = "L-BFGS-B",
        lower = lower_bounds,
        upper = upper_bounds,
        control = list(maxit = 1000)
      )
      
      # Fit background-only model
      background_result <- optim(
        par = initial_params[1:2],
        fn = log_likelihood_background,
        lengths = enzyme_lengths,
        method = "L-BFGS-B",
        lower = lower_bounds[1:2],
        upper = upper_bounds[1:2]
      )
      
      # Extract parameters
      opt_params <- mixture_result$par
      names(opt_params) <- c("alpha", "beta", "mu", "sigma", "p1", "p2", "p3", "p4")
      
      # Calculate derived parameters
      alpha <- opt_params["alpha"]
      beta <- opt_params["beta"] 
      mu <- opt_params["mu"]
      sigma <- opt_params["sigma"]
      p1 <- opt_params["p1"]
      p2 <- opt_params["p2"]
      p3 <- opt_params["p3"]
      p4 <- opt_params["p4"]
      p0 <- 1 - (p1 + p2 + p3 + p4)
      
      cat("Full model optimization completed (convergence:", mixture_result$convergence == 0, ")\n")
      cat("Background model optimization completed (convergence:", background_result$convergence == 0, ")\n")
      
      ## STATISTICAL TESTING: Likelihood Ratio Test (OMICS 2002) + Modern Extensions
      ## Original methodology: χ² test with (k+2) degrees of freedom
      ## Modern additions: AIC/BIC for model comparison, robust uncertainty quantification
      
      # Statistical tests
      lrt_statistic <- 2 * (background_result$value - mixture_result$value)
      p_value <- 1 - pchisq(lrt_statistic, df = 6)
      
      n <- length(enzyme_lengths)
      n_params_full <- 8
      n_params_background <- 2
      
      ## MODEL SELECTION CRITERIA: Information-Theoretic Extensions
      ## AIC/BIC provide alternative to pure significance testing
      ## These criteria were not available/standard in 2002 bioinformatics
      
      AIC_full <- 2 * mixture_result$value + 2 * n_params_full
      AIC_background <- 2 * background_result$value + 2 * n_params_background
      
      BIC_full <- 2 * mixture_result$value + n_params_full * log(n)
      BIC_background <- 2 * background_result$value + n_params_background * log(n)
      
      cat("\nStatistical Testing:\n")
      cat("LRT statistic:", round(lrt_statistic, 2), "\n")
      cat("P-value:", ifelse(p_value < 2.2e-16, "< 2.2e-16", sprintf("%.2e", p_value)), "\n")
      cat("AIC improvement:", round(AIC_background - AIC_full, 1), "\n")
      cat("BIC improvement:", round(BIC_background - BIC_full, 1), "\n")
      
      ## BOOTSTRAP ANALYSIS: Modern Uncertainty Quantification
      ## MAJOR EXTENSION BEYOND OMICS 2002: Bootstrap confidence intervals
      ## Provides robust uncertainty estimates for parameter estimates
      ## Not available in original computational framework (2002 limitations)
      
      # Bootstrap Analysis
      cat("\nRunning bootstrap analysis for confidence intervals...\n")
      
      n_bootstrap <- 100
      bootstrap_params <- matrix(NA, nrow = n_bootstrap, ncol = 8)
      
      set.seed(42)
      for (i in 1:n_bootstrap) {
        if (i %% 20 == 0) cat("  Bootstrap iteration", i, "of", n_bootstrap, "\n")
        
        # Bootstrap sample
        sample_indices <- sample(length(enzyme_lengths), replace = TRUE)
        bootstrap_sample <- enzyme_lengths[sample_indices]
        
        # Fit model to bootstrap sample
        tryCatch({
          boot_result <- optim(
            par = opt_params,
            fn = log_likelihood_mixture,
            lengths = bootstrap_sample,
            method = "L-BFGS-B",
            lower = lower_bounds,
            upper = upper_bounds,
            control = list(maxit = 200)
          )
          
          if (boot_result$convergence == 0) {
            bootstrap_params[i, ] <- boot_result$par
          } else {
            bootstrap_params[i, ] <- opt_params
          }
        }, error = function(e) {
          bootstrap_params[i, ] <- opt_params
        })
      }
      
      # Calculate confidence intervals
      valid_boots <- complete.cases(bootstrap_params)
      n_valid_boots <- sum(valid_boots)
      
      if (n_valid_boots >= 10) {
        lower_ci <- apply(bootstrap_params[valid_boots, ], 2, quantile, probs = 0.025, na.rm = TRUE)
        upper_ci <- apply(bootstrap_params[valid_boots, ], 2, quantile, probs = 0.975, na.rm = TRUE)
      } else {
        warning("Insufficient successful bootstrap iterations")
        lower_ci <- upper_ci <- opt_params
      }
      
      cat("Bootstrap completed:", n_valid_boots, "successful iterations\n")
      
      ## PARAMETER INTERPRETATION: Biological Significance
      ## μ (base period): Fundamental domain unit size
      ## σ (period standard deviation): Variability in domain organization
      ## p_i (mixing weights): Relative frequency of i-domain proteins
      ## Bootstrap CIs quantify uncertainty in these biological parameters
      
      # Parameter Estimates Table
      param_names <- c("α (gamma shape)", "β (gamma scale)", "μ (base period)", "σ (period std)", 
                       "p₁ (1× weight)", "p₂ (2× weight)", "p₃ (3× weight)", "p₄ (4× weight)")
      
      parameter_estimates <- data.frame(
        Parameter = param_names,
        Estimate = round(opt_params, 4),
        Lower_CI = round(lower_ci, 4),
        Upper_CI = round(upper_ci, 4),
        CI_Width = round(upper_ci - lower_ci, 4)
      )
      
      # keeping RMD concise - there is a summary table for all four datasets at the end.
      
      # print(kable(parameter_estimates,
      #            caption = paste0("Parameter Estimates with 95% CIs - ", variable_name),
      #            col.names = c("Parameter", "Estimate", "Lower CI", "Upper CI", "CI Width")) %>%
      #        kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
      #                      full_width = FALSE) %>%
      #        row_spec(0, bold = TRUE, background = "#f8f9fa") %>%
      #        pack_rows("Background Distribution", 1, 2, label_row_css = "background-color: #e3f2fd;") %>%
      #        pack_rows("Periodic Structure", 3, 4, label_row_css = "background-color: #f3e5f5;") %>%
      #        pack_rows("Mixing Weights", 5, 8, label_row_css = "background-color: #e8f5e8;"))
      
      ## MIXTURE MODEL VISUALIZATION: Enhanced Implementation
      ## OMICS 2002 Figure 3 inspired visualization with modern enhancements
      ## Shows empirical data, estimated model components, and full mixture fit
      
      # Individual Mixture Model Visualization
      cat("\n### Mixture Model Visualization -", variable_name, "\n\n")
      
      # Prepare data for plotting
      x_range <- seq(50, 600, length.out = 551)
      
      # Calculate component PDFs
      background_pdf <- dgamma(x_range, shape = alpha, scale = beta)
      peak1_pdf <- dnorm(x_range, mean = mu, sd = sigma)
      peak2_pdf <- dnorm(x_range, mean = 2*mu, sd = sqrt(2)*sigma)
      peak3_pdf <- dnorm(x_range, mean = 3*mu, sd = sqrt(3)*sigma)
      peak4_pdf <- dnorm(x_range, mean = 4*mu, sd = sqrt(4)*sigma)
      
      # Create mixture PDF
      mixture_pdf <- p0 * background_pdf + p1 * peak1_pdf + p2 * peak2_pdf + 
                     p3 * peak3_pdf + p4 * peak4_pdf
      
      # Create histogram data
      hist_data <- hist(enzyme_lengths, breaks = seq(50, 600, by = 10), plot = FALSE)
      hist_density <- hist_data$density
      hist_centers <- hist_data$mids
      
      # Enhanced visualization
      par(mar = c(8, 5, 4, 2), bg = "white")
      
      # Plot histogram
      plot(hist_centers, hist_density, type = "h", lwd = 6, col = "lightblue", 
           xlim = c(50, 600), ylim = c(0, max(hist_density) * 1.1),
           main = paste("Mixture Model -", variable_name),
           xlab = "Sequence Length (amino acids)", ylab = "Density",
           cex.main = 1.3, cex.lab = 1.2, cex.axis = 1.1)
      
      # Add model components
      lines(x_range, p0 * background_pdf, col = "red", lwd = 3, lty = 2)
      if(p1 > 0.001) lines(x_range, p1 * peak1_pdf, col = "green", lwd = 2)
      if(p2 > 0.001) lines(x_range, p2 * peak2_pdf, col = "blue", lwd = 2)
      if(p3 > 0.001) lines(x_range, p3 * peak3_pdf, col = "magenta", lwd = 2)
      if(p4 > 0.001) lines(x_range, p4 * peak4_pdf, col = "cyan", lwd = 2)
      lines(x_range, mixture_pdf, col = "black", lwd = 3)
      
      # Add vertical lines at significant peak positions
      if(p1 > 0.001) abline(v = mu, col = "green", lty = 3, lwd = 2)      # <-- ADDED THIS
      if(p2 > 0.001) abline(v = 2*mu, col = "blue", lty = 3, lwd = 2)
      if(p3 > 0.001) abline(v = 3*mu, col = "magenta", lty = 3, lwd = 2)
      if(p4 > 0.001) abline(v = 4*mu, col = "cyan", lty = 3, lwd = 2)
      
      # Legend
      legend_items <- c("Data", "Background", "Full Model")
      legend_colors <- c("lightblue", "red", "black")
      legend_lty <- c(1, 2, 1)
      legend_lwd <- c(6, 3, 3)
      
      if(p1 > 0.001) {                                                    # <-- ADDED THIS BLOCK
      legend_items <- c(legend_items, paste0(round(mu), " aa (1×)"))
      legend_colors <- c(legend_colors, "green")
      legend_lty <- c(legend_lty, 1)
      legend_lwd <- c(legend_lwd, 2)
      }
      
      if(p2 > 0.001) {
        legend_items <- c(legend_items, paste0(round(2*mu), " aa (2×)"))
        legend_colors <- c(legend_colors, "blue")
        legend_lty <- c(legend_lty, 1)
        legend_lwd <- c(legend_lwd, 2)
      }
      if(p3 > 0.001) {
        legend_items <- c(legend_items, paste0(round(3*mu), " aa (3×)"))
        legend_colors <- c(legend_colors, "magenta") 
        legend_lty <- c(legend_lty, 1)
        legend_lwd <- c(legend_lwd, 2)
      }
      if(p4 > 0.001) {
        legend_items <- c(legend_items, paste0(round(4*mu), " aa (4×)"))
        legend_colors <- c(legend_colors, "cyan")
        legend_lty <- c(legend_lty, 1)
        legend_lwd <- c(legend_lwd, 2)
      }
      
      legend("topright", legend = legend_items, col = legend_colors,
             lty = legend_lty, lwd = legend_lwd, cex = 1.0, bg = "white")
      
      # Format p-value
      p_value_text <- if(p_value < 2.2e-16) {
        "< 2.2e-16"
      } else {
        format(p_value, scientific = TRUE, digits = 2)
      }
      
      # Statistical annotation
      mtext(paste0("Base Period: ", round(mu, 1), " ± ", round(sigma, 1), " aa  |  ",
                   "LRT p-value: ", p_value_text, "  |  ",
                   "Bootstrap CI: ", round(lower_ci[3], 1), " - ", round(upper_ci[3], 1), " aa"),
            side = 1, line = 6, cex = 1.0, font = 2)
      
      grid(col = "gray90", lty = "dotted")
      
      # Store mixture model results (expanded)
      dataset_results$mixture_model <- list(
        converged = mixture_result$convergence == 0,
        alpha = as.numeric(alpha),
        beta = as.numeric(beta),
        mu = as.numeric(mu),
        sigma = as.numeric(sigma),
        p0 = as.numeric(p0),
        p1 = as.numeric(p1),
        p2 = as.numeric(p2),
        p3 = as.numeric(p3),
        p4 = as.numeric(p4),
        lrt_statistic = lrt_statistic,
        p_value = p_value,
        AIC_delta = AIC_background - AIC_full,
        BIC_delta = BIC_background - BIC_full,
        bootstrap = list(
          n_successful = n_valid_boots,
          lower_ci = lower_ci,
          upper_ci = upper_ci
        ),
        # Store plot data for composite
        hist_centers = hist_centers,
        hist_density = hist_density,
        x_range = x_range,
        mixture_pdf = mixture_pdf,
        background_pdf = background_pdf,
        peak1_pdf = peak1_pdf,
        peak2_pdf = peak2_pdf,
        peak3_pdf = peak3_pdf,
        peak4_pdf = peak4_pdf
      )
      
      cat("Analysis completed for", variable_name, "\n")
      
    } else {
      dataset_results$error <- "No valid data"
    }
    
  }, error = function(e) {
    cat("ERROR loading", filename, ":", e$message, "\n")
    dataset_results$error <- e$message
  })
  
  # Store results
  all_results_teamc[[dataset_idx]] <- dataset_results
}

Dataset 1 : Dataset_1

Using pre-loaded and standardized data for Dataset_1 … Using 14685 valid sequences (50-600 aa) Fitting mixture model… Full model optimization completed (convergence: TRUE ) Background model optimization completed (convergence: TRUE )

Statistical Testing: LRT statistic: 2710.89 P-value: < 2.2e-16 AIC improvement: 2698.9 BIC improvement: 2653.3

Running bootstrap analysis for confidence intervals… Bootstrap iteration 20 of 100 Bootstrap iteration 40 of 100 Bootstrap iteration 60 of 100 Bootstrap iteration 80 of 100 Bootstrap iteration 100 of 100 Bootstrap completed: 97 successful iterations

Mixture Model Visualization - Dataset_1

Analysis completed for Dataset_1

Dataset 2 : Dataset_2

Using pre-loaded and standardized data for Dataset_2 … Using 13328 valid sequences (50-600 aa) Fitting mixture model… Full model optimization completed (convergence: TRUE ) Background model optimization completed (convergence: TRUE )

Statistical Testing: LRT statistic: 2513.56 P-value: < 2.2e-16 AIC improvement: 2501.6 BIC improvement: 2456.6

Running bootstrap analysis for confidence intervals… Bootstrap iteration 20 of 100 Bootstrap iteration 40 of 100 Bootstrap iteration 60 of 100 Bootstrap iteration 80 of 100 Bootstrap iteration 100 of 100 Bootstrap completed: 100 successful iterations

Mixture Model Visualization - Dataset_2

Analysis completed for Dataset_2

Dataset 3 : Dataset_3

Using pre-loaded and standardized data for Dataset_3 … Using 18985 valid sequences (50-600 aa) Fitting mixture model… Full model optimization completed (convergence: TRUE ) Background model optimization completed (convergence: TRUE )

Statistical Testing: LRT statistic: 3951.86 P-value: < 2.2e-16 AIC improvement: 3939.9 BIC improvement: 3892.7

Running bootstrap analysis for confidence intervals… Bootstrap iteration 20 of 100 Bootstrap iteration 40 of 100 Bootstrap iteration 60 of 100 Bootstrap iteration 80 of 100 Bootstrap iteration 100 of 100 Bootstrap completed: 100 successful iterations

Mixture Model Visualization - Dataset_3

Analysis completed for Dataset_3

Dataset 4 : Dataset_4

Using pre-loaded and standardized data for Dataset_4 … Using 13980 valid sequences (50-600 aa) Fitting mixture model… Full model optimization completed (convergence: TRUE ) Background model optimization completed (convergence: TRUE )

Statistical Testing: LRT statistic: 2587.59 P-value: < 2.2e-16 AIC improvement: 2575.6 BIC improvement: 2530.3

Running bootstrap analysis for confidence intervals… Bootstrap iteration 20 of 100 Bootstrap iteration 40 of 100 Bootstrap iteration 60 of 100 Bootstrap iteration 80 of 100 Bootstrap iteration 100 of 100 Bootstrap completed: 100 successful iterations

Mixture Model Visualization - Dataset_4

Analysis completed for Dataset_4

# Filter successful analyses
successful_analyses <- all_results_teamc[sapply(all_results_teamc, function(x) is.null(x$error))]
n_successful <- length(successful_analyses)

cat("\n## Processing Summary\n")

Processing Summary

cat("Successfully analyzed", n_successful, "out of", length(datasets), "datasets\n")

Successfully analyzed 4 out of 4 datasets

Step 11. - Summary Mixture Modeling Results -

Detailed Mixture Model Results with Bootstrap Confidence Intervals
Dataset N Base Period 95% CI LRT χ² P-value ΔAIC ΔBIC Bootstrap p₀ p₁ p₂ p₃ p₄
Control Group
Dataset_1 14685 128 ± 20 127.6 - 128.4 2710.89 < 2.2e-16 2698.9 2653.3 97/100 (97%) 52.5% (49.1 - 55.7%) 0.5% (0 - 0.8%) 0.5% (0 - 1.4%) 17.8% (16.5 - 19%) 28.7% (27.8 - 29.7%)
Dataset_2 13328 127.9 ± 20 127.3 - 128.3 2513.56 < 2.2e-16 2501.6 2456.6 100/100 (100%) 51.9% (48 - 55.2%) 0.8% (0.4 - 1.1%) 0.6% (0 - 1.9%) 17.8% (16.5 - 19.1%) 28.9% (28 - 29.9%)
Independent Datasets
Dataset_3 18985 127.7 ± 20 127.4 - 128.3 3951.86 < 2.2e-16 3939.9 3892.7 100/100 (100%) 51.5% (49.1 - 53.6%) 0.3% (0 - 0.7%) 0% (0 - 0.1%) 19.4% (18.4 - 20.6%) 28.8% (28 - 29.5%)
Dataset_4 13980 127.7 ± 20 127.1 - 128.4 2587.59 < 2.2e-16 2575.6 2530.3 100/100 (100%) 51.9% (48.3 - 55.5%) 0% (0 - 0%) 0.7% (0 - 1.9%) 18.4% (16.7 - 19.9%) 28.9% (27.8 - 29.9%)

Step 12. - R Session Information -

# Document the computational environment for reproducibility
cat("Analysis completed on:", format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z"), "\n\n")
## Analysis completed on: 2025-09-16 14:43:16 EDT
cat("R Session Information:\n")
## R Session Information:
cat("===================\n")
## ===================
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.3.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] patchwork_1.3.1  kableExtra_1.4.0 knitr_1.50       lubridate_1.9.4 
##  [5] forcats_1.0.0    stringr_1.5.1    dplyr_1.1.4      purrr_1.1.0     
##  [9] readr_2.1.5      tidyr_1.3.1      tibble_3.3.0     ggplot2_3.5.2   
## [13] tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     xml2_1.3.8         stringi_1.8.7     
##  [5] hms_1.1.3          digest_0.6.37      magrittr_2.0.3     evaluate_1.0.4    
##  [9] grid_4.4.1         timechange_0.3.0   RColorBrewer_1.1-3 fastmap_1.2.0     
## [13] jsonlite_2.0.0     viridisLite_0.4.2  scales_1.4.0       textshaping_1.0.1 
## [17] jquerylib_0.1.4    cli_3.6.5          rlang_1.1.6        crayon_1.5.3      
## [21] bit64_4.6.0-1      withr_3.0.2        cachem_1.1.0       yaml_2.3.10       
## [25] tools_4.4.1        parallel_4.4.1     tzdb_0.5.0         vctrs_0.6.5       
## [29] R6_2.6.1           lifecycle_1.0.4    bit_4.6.0          vroom_1.6.5       
## [33] pkgconfig_2.0.3    pillar_1.11.0      bslib_0.9.0        gtable_0.3.6      
## [37] glue_1.8.0         systemfonts_1.2.3  xfun_0.52          tidyselect_1.2.1  
## [41] rstudioapi_0.17.1  farver_2.1.2       htmltools_0.5.8.1  labeling_0.4.3    
## [45] rmarkdown_2.29     svglite_2.2.1      compiler_4.4.1